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Abstract 

Iterative processing is widely adopted nowadays in modem wireless receivers for advanced channel codes like turbo 
and LDPC codes. Extension of this principle with an additional iterative feedback loop to the demapping function has 

^NJ ' proven to provide substantial error performance gain. However, the adoption of iterative demodulation with turbo decoding 

^»^ is constrained by the additional implied implementation complexity, heavily impacting latency and power consumption. 

^N) ' In this paper, we analyze the convergence speed of these combined two iterative processes in order to determine the exact 

Vh ■ required number of iterations at each level. Extrinsic information transfer (EXIT) charts are used for a thorough analysis 

at different modulation orders and code rates. An original iteration scheduling is proposed reducing two demapping 
iterations with reasonable performance loss of less than 0.15 dB. Analyzing and normalizing the computational and 

^vj . memory access complexity, which directly impact latency and power consumption, demonstrates the considerable gains 

of the proposed scheduling and the promising contributions of the proposed analysis. 

c/j ■ I. Introduction 

Advanced wireless communication standards impose the use of modern techniques to improve spectral efficiency 
, ■ and reliability. Among these techniques Bit-Interleaved Coded Modulation (BICM) with different modulation orders 

r^ ' and Turbo Codes with various code rates are frequently adopted. 

<^ ' .-. 

f^ I The BICM principle [I] currently represents the state-of-the-art in coded modulations over fading channels. The Bit- 

• , Interleaved Coded Modulation with Iterative Demapping (BICM-ID) scheme proposed in |12) is based on BICM with 

m 

f^ . additional soft feedback from the Soft-Input Soft-Output (SISO) convolutional decoder to the constellation demapper. 

CN ■ 

In this context, several techniques and configurations have been explored. In ||3l, the authors investigated different 
mapping techniques suited for BICM-ID and QAM16 constellations. They proposed several mapping schemes providing 

^\. ', significant coding gains. In H, the convolutional code classically used in BICM-ID schemes was replaced by a turbo 

H ■ 

C^ . code. Only a small gain of 0.1 dB was observed. This result may make BICM-ID with turbo-like coding solutions 

(TBICM-ID) unsatisfactory with respect to the added decoding complexity. On the other hand, authors in ||5] have 

presented a technique intended to improve the performance of TBICM-ID over non Gaussian channels. The proposed 

technique, namely Signal Space Diversity (SSD), consists of a rotation of the constellation followed by a signal space 

component interleaving. It has shown additional error correction at the receiver side in an iterative processing scenario. 

Constellation rotation enables to exploit higher code rates and to solve potential problems in selective channels while 

keeping good performance. It has been proposed for all constellation orders of Quadrature AmpUtude Modulation 

(QAM). Combining constellation rotation with signal space component interleaving leads to significant improvement 

in performance over fading and erasure channels. It increases the diversity order of a communication system without 

using extra bandwidth. BICM coupled with SSD has been extensively studied for single carrier systems, e.g. in ||6], 

Q and the references therein. 
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Application of iterative demapping in this context has shown excellent error rate performance results particularly 
in severe channel conditions (erasure, multi-path, real fading models) ISJ. In that work, LDPC channel coding was 
considered. 

Nevertheless, most of the existing works have not considered these techniques from an implementation perspective. 
In fact, the application of the iterative demapping in future receivers using advanced iterative channel decoding will 
lead to further latency problems, more power consumption and more complexity caused by feedback inner and outer the 
decoder. Besides extrinsic information exchange inside the iterative channel decoder, additional extrinsic information 
is fed back as a priori information used by the demapper to improve the symbol to bit conversion. The number 
of iterations to be run at each level should be determined accurately as it impacts significantly, besides error rate 
performance, latency, power consumption, and complexity. 

This work discusses the implementation efficiency of iterative receiver based on turbo demodulation and turbo 
decoding in order to achieve a gain in band-limited wireless communication systems. Convergence speed is analyzed for 
various system configurations to determine the exact required number of iterations at each level. Significant complexity 
reductions can be achieved by means of the proposed original iteration scheduling. Finally, an accurate normalized 
complexity analysis is presented in terms of arithmetic and memory access operations. 

II. System model and Algorithms 
A. System Model 

The considered system uses one transmit and one receive antenna while assuming perfect synchronization. Fig. [T] 
shows a basic transmitter and receiver model using turbo demodulation and decoding. We denote by TBICM-ID-SSD 
a turbo BICM with iterative demapping coupled with signal space diversity. 
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Fig. 1: System model with TBICM-ID-SSD. 

On the transmitter side, information bits U which are called systematic bits are regrouped into symbols Up consisting 
of / bits, and encoded with an /-binary turbo encoder It consists of a parallel concatenation of two identical convolu- 
tional codes (PCCC). The output codeword C is then punctured to reach a desired coding rate Re. The 8-state double 
binary {I — 2) recursive systematic convolutional code (RSC code), adopted in the WiMax standard, is considered for 
the turbo encoder 

In order to gain resilience against error bursts, the resulting sequence is interleaved using an 5-random interleaver 
112 with S = ^J N/A. Punctured and interleaved bits denoted by V are then gray mapped to channel symbols Sq 
chosen from a 2^^-ary constellation X, M is the number of bits per modulated symbol. 
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Applying the SSD consists first of the rotation of the mapped symbols Sq. The resulting rotated symbols are denoted 
by Sr^q- The performance gain obtained when using a rotated constellation X^ depends on the choice of the rotation 
angle. In this regard, a thorough analysis has been done for the 2nd-generation terrestrial transmission system developed 
by the DVB Project (DVB-T2) which adopted the rotated constellation technique. A single rotation angle IS) has been 
chosen for each constellation size independently of the channel type. Using these angles, and with LDPC code, gains of 
0.5 dB and 6 dB were shown for Rayleigh fading channel without and with erasure respectively for high code rate ||8]. 
The second step consists of signal space component interleaving. When a constellation signal is submitted to a fading 
event, its in-phase component / and quadrature component Q fade identically and suffers from an irreversible loss. 
A means of avoiding this loss involves making / and Q fade independently while each carrying all the information 
regarding the transmitted symbol. By inserting an interleaver (a simple delay for uncorrected fast-fading channel) 
between the / and Q channels, the diversity order is doubled. 

sj, symbols are then transmitted over a noisy and Rayleigh fast-fading channel with or without erasures. In fact, 
erasure events happen in single frequency network (SFN), as in DVB-T2 standard, due to destructive interferences. 
Each received symbol x'^ „ is affected by a different fading coefficient, an erasure coefficient, and additive Gaussian 
noise. 

The channel model considered is a frequency non-selective memoryless channel with erasure probabiUty. The received 
discrete time baseband complex signal can be written as: 

= h'g.s'^g+Uq (1) 

where hq is the Rayleigh fast-fading coefficient, pq is the erasure coefficient probability taking value with a 
probability Pp and value 1 with a probability of 1 — P^. n^ is a complex white Gaussian noise with spectral density 
No/2 in each component axes, and h' is the channel attenuation. Note that, at the receiver side, the transmitted energy 



has to be normalized by a ^1 — Pp factor in order to cope with the loss of transmitted power 

B. Max-Log-MAP Demapping Algorithm 

At the receiver side, the complex received symbols x' have their Q-components re-shifted resulting in Xr.q. 
An extrinsic log-likelihood ratio Lext,Dem{cp,q/xr.q) is calculated for each bit Cp^q corresponding to the p*'* bit of 
the received rotated and modulated symbol Xr,q- After de-interleaving, de-puncturing and turbo decoding, extrinsic 
information from the turbo decoder Lext.Dec{cp,q) is passed through the interleaver, punctured and fed back as a 
priori information Lapr.Dem{cp.q) to the demapperin a turbo demapping scheme. Lext,Dem{cp,q / Xr.q) is the difference 
between the soft output a posteriori LDem{cp,q/xr,q) and Lapr,Dem{cp,q) at the demapper side, it is given by the 
expression below: 

^ext.DeynKpp.q I •^r^ql ^ Dein\pp^q/ ^r.q) ^apr.DemyCp.q ) 

= '°<l) <^> 

^1(1=0,1) can be expressed as: 

zi^i^o,i)= E ^"^'- n ^(^*.«) (3) 
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where X^i, with I G {0,1}, are the symbol sets of the constellation for which symbols have their i*'* bit equal 
to I. P{ci,q) is the probability of the i*'* bit of constellation symbol Sr.q computed through a priori information 
Lapr.Dem{ci,q)- Rcducing the complexity of the expressions above can be performed by applying the max-log approx- 
imation. Thus, equation Q can be written as; 

Lext.Dem{Cp,q/Xr,q) = min (Ag - Bp^q) - min {Aq - Bp,q) (4) 

where Aq and Bp^q are computed as follows. 

/ M-1 \ 

^PtQ — I / ^ ^apr^Demy^i.q) I ^apr.Dera\^p,q) y^) 

In fact, the above demapping equations are valid for both channel models (with or without erasures) through the use 
of h' coefficient (equation ([TJ). 

These simplified expressions exhibit three main computation steps: (a) Euclidean distance computation referred by 
Aq, (b) a priori adder referred by Bpq, and (c) minimum finder referred by the min operation of equation (|4|i. 

C. Max-Log-MAP Decoding Algorithm 

Following the demapping function at the receiver side, the turbo decoding algorithm is applied. The BCJR algorithm 
is considered for the Soft Input Soft Output (SISO) convolutional decoders. Using input symbols and a priori extrinsic 
information, each SISO decoder computes a posteriori probabilities. The BCJR SISO decoder computes first the 
branch metrics 7. Then it computes the forward a^ and backward /3fc metrics between two trellis states s and s' IQ). 

ak{s) = max(afc-.i(s ) +7fc(s ,s)) (7) 

(s',s) 

Pkis) = max(/3fe+i(s ) + 7fe+i(s , s)) (8) 

where 

7fc(^',«)=7f^^(s',^)+7r"*n^',5)+7r*(s',^) (9) 

The soft output information so{dk — CpCp+i) and symbol-level extrinsic information z{dk = CpCp+i) of symbol k 
are then computed using equations dTOl i and (fTTT l. The extrinsic information, which is exchanged iteratively between 
the two SISO decoders, is obtained by subtracting the intrinsic information from so{dk — CpCp+i). 

so{dk) = max (ak-i{s ) + -fk{s , s) + l3k{s)) (10) 

{s\s)/d{s\s)=dk ^ ^ 

z(4)= max (afc_i(s')+7f^*(s',s)+/3fc(s)) (11) 

{s' ,s)/d(s',s)=dk ^ ^ 

z{dk) can be multiplied by a constant scaling factor SF (typically equals to 0.75) for a modified Max-Log-MAP 
algorithm improving the resultant error rate performance. 

Finally, in case of turbo demapping and only by one SISO decoder, the bit-level extrinsic information of systematic 
symbols CpCp+i are computed using equations (fT2] l and (fTsT l. Similar computations are done for parity symbols 

Cp+2Cp+3- 

Lapr,Dem{cp) = max[z(dfc = 11), z{dk = 10)] - max[z((ifc = 01), z{dk = 00)] (12) 

Lapr,Dern{cp+i) = max[z((ifc = 11), z{dk = 01)] - max[z(dfe = 10), z{dk = 00)] (13) 
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These expressions exhibit three main computation steps: (a) branch metrics computation referred by 7^, (b) state 
metrics computation referred by (at and /3fc), and (c) extrinsic information computation referred by Lapr,Dem and z. 

III. TBICM-SSD AND TBICM-ID-SSD CONVERGENCE SPEED ANALYSIS 

This section illustrates the impact of constellation rotation and iterative demapping on the convergence speed. 
Convergence speed designates the rapidity of the convergence of the iterative process. 

EXIT charts ifTOl are used as a useful tool for a clear and thorough analysis of the convergence speed. They were first 
proposed for parallel concatenated codes, and then extended to other iterative processes. For iterative demapping with 
turbo decoding (TBICM-ID-SSD), authors in ||5l have used this tool to analyze the iterative exchange of information 
between the different SISO components. In this system receiver with two iterative processes, the response of the two 
SISO decoders is plotted while taking into consideration the SISO demapper with updated inputs and outputs. 

In this scheme, lAi, IA2, lEi, IE2 are used to designate the a priori and extrinsic information respectively for DECi 
and DEC2 (Fig. [l). Iterations start without a priori information {lAi = and IA2 = 0). Then, extrinsic information 
I El of DECi is fed to DEC2 as a priori information IA2 and vice versa, i.e. lEi = IA2 and IE2 = lAi. Since 
this EXIT chart analysis is asymptotic, infinite long BICM interleaver size should be assumed. The SISO decoder is 
represented by its transfer function: 

IE = T{IA,Eb/No) (14) 

Extensive analysis for different E\j/Nq and different system parameters (modulation orders and code rates) has been 
conducted and gave similar results. Fig. |2] illustrates one of these simulations for QAM64, code rate |, Ei,/Nq=22 dB, 
and erasure probability equals to 0. 15. The transfer function of the turbo decoder is represented by the two-dimensional 
chart as follows. One SISO decoder component is plotted with its input on the horizontal axis and its output on the 
vertical axis. The other SISO component is plotted with its input on the vertical axis and its output on the horizontal 
axis. The iterative decoding corresponds to the trajectory found by stepping between the two curves. For a successful 
decoding, there must be a clear path between the two curves so that iterative decoding can proceed from to 1 mutual 
extrinsic information. 



u 
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(1) Trajectory of TBICM-SSD 

(2) Trajectory of TBICM-ID-SSD 
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Fig. 2: EXIT chart analysis at an Eii/No= 22 dB of the dual binary turbo decoder for iterations to the QAM64 
demapper Code rate | is considered for transmission over Rayleigh fast-fading channel with erasure probability 
equals to 0.15. 
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The plain curves of Fig. |2] correspond to the EXIT charts for the case with rotated constellation. Meanwhile, 
the dashed curves correspond to the case with no rotation. Furthermore, the red curves correspond to non iterative 
demapping, e.g. SISO demapper executed once in feed forward scheme. Applying demapping iterations corresponds 
to the other colored curves in the EXIT charts of Fig. |2] 

In this figure, we observe that the EXIT tunnel is wider for the rotated case than the one without. Furthermore, the 
tunnel is limited to that of one demapping iteration for the latter case. Thus, making more demapping iterations will not 
affect the convergence speed of non rotated constellation configurations. However the tunnel is enlarging (improving) 
until three demapping iterations using the rotated constellation. For TBICM-SSD, EXIT charts show a need of more 
than 6 turbo decoding iterations to attain convergence following the trajectory (1). Whereas 4 demapping iterations 
are sufficient following the trajectory (2). 

Thus, in case of TBICM-ID-SSD, the iteration scheduling which optimize the convergence is the one that enlarge 
the EXIT tunnel as soon as possible. Analyzing the different tunnel curves in the EXIT figure shows that the tunnel is 
enlarging for each demapping iteration. Thus, the optimized scheduling is to execute only one turbo decoding iteration 
for each demapping iteration and then step forward to the next demapping iteration (enlarge the EXIT tunnel). This 
scheduling is the one adopted implicitly in jSj. Note that after the third demapping iteration, only a slight improvement 
in convergence is observed. Similar results have been found for all considered modulation orders, code rates and erasure 
coefficients. This result will be used in the next section to reduce the number of demapping iterations. 

IV. Reducing the number of demapping iterations 

As mentioned in the previous section, the optimized profile applies one turbo code iteration for each demapping 
iteration. Thus, reducing the number of turbo demapping iterations will reduce the total number of iterations for the 
turbo decoder. 

However, various constructed EXIT charts with different parameters show that after a specific number of demapping 
iterations, only a slight improvement is predicted. As an example, in Fig. |2] decoder transfer functions coincide with 
each other after 3 demapping iterations. However, one can notice that turbo decoding iterations must continue until 
that the two constituent decoders agree with each other Thus, the number of demapping iterations can be reduced 
without affecting error rates, while keeping the same total number of turbo decoding iterations. This constitutes the 
basis for our proposed original iteration scheduling. 

In fact, to keep the same number of iterations for the decoder unaltered, one turbo code iteration is added after the 
last iteration to the demapper for each eliminated demapping iteration. Fig. |3] simulates six turbo demapping iterations 
performing one turbo decoding iteration for each. Hence, six turbo code iterations are performed in total. This scheme 
is denoted as 6IDem. 

With the proposed iteration scheduling, 5IDem_lEIDec designates five demapping iterations (one turbo code 
iteration is applied for each) followed by one extra turbo code iteration. 

Referring to Fig. |3] error rates associated to QIDera and 5IDem_lEIDec show almost same performances, while 
one feedback to the demapper is eliminated in the latter scheme. Similarly, for AIDem_2EIDec, two feedbacks to 
the demapper are eliminated. A slight loss of 0.025 dB is induced. Eliminating more demapping iterations will cause 
significant performance degradation. 3IDern_3EIDec is closer to hIDem than to GIDem. 

In fact, AIDem_2EIDec represents the most optimized curve for the QIDera performance scheme as shown in Fig. 
|3] EXIT charts do not agree with this consideration at the first sight, three demapping iterations were sufficient to do 
the same correction as the eight iterations. EXIT charts are based on average calculations as many frames are simulated. 
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Fig. 3: BER performance comparison for TBICM-ID-SSD for the transmission of 1536 information bits frame over 
Rayleigh fast-fading channel without erasure. QAM64 modulation scheme with code rate | are considered. 



Modulation scheme 


Perfonnance loss (dB) 


Without Erasure 

Re = 6/7 -c Re = 1/2 


With Erasure 

Re = 6/7 -> Re = 1/2 


QPSK 


0.02 -c 0.03 


0.02 -> 0.05 


QAM 16 


0.04 -c 0.06 


0.04 -t 0.08 


QAM64 


0.05 -c 0.08 


0.07 -t>0.12 


QAM256 


0.07 -c 0.10 


0.09 -& 0.15 



TABLE I: Perfonnance loss for different modulation schemes and code rates after 2 omitted demapping iterations. 

The three demapping iterations represents the average number of demapping iterations needed to be sure that the two 
constituent decoders agree with each other Making more demapping iterations will provide more error correction. 
Further simulations show performance loss of 0.02 dB to 0.1 dB and 0.02 dB to 0.15 dB for no erasure and erasure 
events respectively when the proposed scheduling is applied. Table U summarizes the reduced performance loss for 
different code rates and constellation orders after omitting two demapping iterations. These values were investigated 
for the worst case corresponding to 3IDem_2EIDec in comparison to 5IDem. Note that for error floor region, 
simulations show almost identical BER performance if applying more than 3 demapping iterations. Furthermore, it 
is worth noting that with a limited-diversity channel model, omitting 2 demapping iterations leads to slightly lower 
performance loss than those of Table U for a fast-fading channel model. In fact, one demapping iteration with high- 
diversity channel model leads to more error correction compared to one iteration executed with limited-diversity one. 
Conducted simulations with block-fading channel model have confirmed this result. 

Using this technique, latency and complexity issues caused by the TBICM-ID-SSD are reduced. Two feedbacks to 
the demapper with the associated delays, computations, and memory accesses are eliminated. It is worth noting that the 
proposed new scheduling does not have any impact on the receiver area (logic or memory). This scheduling is applied 
on a TBICM-ID-SSD receiver and proposes complexity reduction in "temporal dimension" (which impacts power 
consumption, throughput, and latency). Complexity reductions will be evaluated and discussed in the next section. 

V. Complexity evaluation and normalization 

The main motivation behind the conducted convergence speed analysis and the proposed technique for reducing 
the number of iterations is to improve the receiver implementation quality. In order to appreciate the achieved 
improvements, an accurate evaluation of the complexity in terms of number and type of operations and memory access 
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Pai-ameter 


Number of bits 


SISO 

demapper 


Received complex input (a;^_^,a;^,) 


(10,10) 


Coeff. Fading & Variance (/ij^)/(cr2) 


8 


Constellation complex symbol (s^ v^^r i) 


(12,12) 


Euclidean distance Aq 


19 


SISO 
decoder 


Received 4 LLRs 


4 X 5 


Branch metric 7^. 


10 


State metric 0^,0^ 


10 


Extrinsic information z 


10 



TABLE II: Typical quantization values. 



Arithmetic operations (712 > "^1) 


Noraialized arithmetic operations 


1 Add{ni,n2) 


0.5 X (ni+n2-l)Ac;d{l,l) 


1 5u6(ni,7i2) 


0.5 X {nx+n2)Add{l,l) 


1 Mul{n\, 712) 


\{ni - l){n2 - 1) + 1 - 0.5 X m]Add{l, 1) 



TABLE III: Complexity normalization in terms of 
Add{l,l). 



SISO rotated 
demapper with 
a priori input 


Computation units 


Number and Type of operations per modulated symbol per turbo demapping iteration 


Euclidean distance 


2^'^.4dd(18. 18) + 2^^+^Sub(S, 10) + 2^^+'^Mul{8,8) + 2^'^+M/i/Z(S, 10) + 2load(10) + (1 + 2^'^)loadiS) 


a priori adder 


(2^^ -2){E[^^]Add(S,8) + E[-^]Add(9,9) + E[^^]Add(10,l()) + M5ufo(8, 11) + M5ufe(ll, 19)} + Mload(S) + 2^Hoad(M) 
For QPSK M * 2^ - 2Sub(ll, 19) + Mload{8) + 2^load[M) 


Minimum finder 


MSub{8, 8) + M.2^^Sub{8, 19) + Mstore(8) 


SISO double 
binary turbo 
decoder 


Computation units 


Number and Type of operations per coded symbol per turbo decoding iteration 


Branch metric 


4Add{r-,, 5) + 38Add(5, 10) + 4Sub{^>, 5) + 8/oQ.rf(5) + 6load{U)) 


State metric 


64Add{10, 10) + 48Sub{9, 9) + 8store(10) 


extrinsic information 


32Add(10, 10) + 32Sub(9, 9) + 9Sub(10, 10) + 3Mul{4:, 10) + Sload(10) + 5store{lQ) 



TABLE IV: Complexity computation summary, 
is required. Such complexity evaluation is fair and generalized as it is independent from the architecture mode (serial 
or parallel) and remains valid for both of them. In fact, all architecture alternatives should execute the same number 
of operations (serially or concurrently) to process a received frame. In this section, we consider the two main blocks 
of the TBICM-ID-SSD system configuration which are the SISO demapper and the SISO decoder The proposed 
evaluation considers the low complexity algorithms presented in section |II] A typical fixed-point representation of 
channel inputs and various metrics is considered. Table |II] summarizes the total number of required quantization bits 
for each parameter. 

A. Complexity evaluation of SISO demapper 

The complexity of SISO demapping depends on the modulation order (in the context of the above fixed pa- 
rameters). We will now consider the equations of subsection III-BI to compute: (1) the required number and type 
of arithmetic computations and (2) the required number of read memory access (load) and write memory access 
(store). The result of this evaluation is summarized in Table |IV] and explained below. We use the following nota- 
tion operation(NbOfBitsOfOperandl,NbOfBitsOfOperand2) for arithmetic Operations, and ^oad(NbOfBits)/sfore(NbOfBits) for 
read/write memory operations. Thus, a(id(8, 10) indicates an addition operation of two operands; one quantized on 8 
bits and the second on 10 bits. Similarly, load{8) indicates a read access memory of 8-bit word length. 

1) Euclidean distance computation 

For each modulated symbol (input of the demapper): 

• One load{8) to access the fading channel coefficient normalized by the channel variance 

• Two load{10) to access the channel symbols xj. „ and a;^„. 

• For each one of the 2*^ symbols of the constellation (s^ , s^ ): 
- Two load{8) to access the constellation symbols s^ ■ and s^- 



h? 



- Two Sub{8, 10) to compute (x^ „ — s^ J and (x:; 

- Two Mul{8, 10) to multiply with the channel c 

- Two Mu/(10, 10) to compute the square of the results above 



- Two Mul{8, 10) to multiply with the channel coefficients -^ and -^-^ 
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- One Add{18, 18) to realize the sum of the two EucUdean distance terms 

2) A priori adder 

For each modulated symbol (input of the demapper): 
« M load{8) to access the a priori informations Lapr.Dem{cp,q) 

• For each one of the 2*^ symbols of the constellation (s^ , s^ ), except two symbols corresponding to all 
zeros and all ones: 

- One load{M) to access constellation symbol bits Cp^q. k — 0,1, . . . , M — 1 

- -E[ *^~"'" ] Add{8,8) to realize the sum of two Lapr,Dem{cp,q) 

- E\ ^~^ \ Addi^, 9) to realize the sum of four Lapr,Demicp^q) 

- E[ '^'^~^ ] Add{10, 10) to realize the sum of eight Lapr.Dem{cp.q) 

- M Sub{8, 11) to subtract the LLR of the specific p*'' bit and thus obtain Bpq 

- M Sub{ll, 19) to reaHze Aq - Bp^q 

E[x\ represents here the ordinary rounding of the positive number x to the nearest integer 
However, for the simple QPSK modulation the above operations can be simplified as only 2 LLRs exist for one 
modulated symbol. In fact, in equation (|6]l there is no need to execute an addition followed by a subtraction of 
the same LLR. Thus, the total number of required arithmetic operations in this case is 45m6(11, 19). 

3) Minimum finder 

For each one of the M bits per modulated symbol: 

• 2*^ Sub{\%, 19) to realize the two min operations of equation (|4) 

• One Suh{8, 8) to subtract the above found 2 minimum values resulting in the demapper extrinsic information 

• One store{8) to store the extrinsic information value 

B. Complexity evaluation of SISO decoder 

The SISO decoder complexity is composed of 3 principal units: branch metric, state metric, and extrinsic information 
functions. As for the SISO demapper, the result of the complexity evaluation is summarized in Table HV] and explained 
below. As stated before, the considered turbo code is an 8-state double binary one. At the turbo decoder side, each 
double binary symbol should be decoded to take a decision over the 4 possible values (00, 01, 10, 11). 

1) Branch metrics (7) 

For each coded symbol (input of the decoder): 

• 4 load{5) to access systematic and parity LLRs 

• 3 load(lQ) to access demapper normaUzed extrinsic informations 

• 2 Add{5, 5) and 2 Sub{5, 5) to compute systematic and parity branch metrics 71]^", 7io"'' 7ii'^'^' ^ ^"d 

Parity 

7io 

• 19 Add{5, 10) to compute branch metrics 7^ and 7^,^^ + 7j, '^" ^ 

Operations above should be multiplied by 2 to generate forward and backward branch metrics. 

2) State metrics (a,/3) 

For each coded symbol (input of the decoder): 

• 32 Add{lO, 10) to compute ak-i{s ) + 7/c(s , s) for the 32 trellis transitions (8-state double binary trellis) 

• 24 Sub{9, 9) to realize the 8 max (4-input) operations of equation d?). In fact, 1 max (N-input) can be 
implemented as N-1 max (2-input) operations. 1 max (2-input) corresponds to 1 Sub 
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• 8 store{lO) to store computed state metrics only for left butterfly algorithm 

Operations above should be multiplied by 2 to generate forward a and backward (3 state metrics. 
3) Extrinsic information (2) 

For each coded symbol (input of the decoder): 
« 8 load{10) to access state metric values 

< 32 Add{lO, 10) to compute the second required addition operation in equation ( fTOl i for the 32 trellis 
transitions 

• 28 Sub{9, 9) to realize the 4 max (8-input) operations of equation (fTOt 

< 4 Sub{10, 10) to subtract symbol-level intrinsic information from the computed soft value (generating 
symbol-level extrinsic information) 

■ 8 Sub{9,9) and 4 5*^6(10,10) to realize the 8 max (2-input) operations and compute 4 bit-level (sys- 
tematic and parity) extrinsic information as demapper a priori information (equations (fT2] l and (fT3Tl). This 
computation is done only for one of the two SISO decoders 

< 4 store(lO) to store the computed bit-level (systematic and parity) extrinsic information 

« 3 Sub{10, 10) to normalize symbol-level extrinsic information by subtracting the one related to decision 00 
« 3 Mul{4:, 10) to multiply the symbol-level extrinsic information by a scaling factor SF 

• 3 store(lO) to store the computed DECi symbol-level extrinsic information as DEC2 a priori information 

C. Complexity normalization 

The above conducted complexity analysis exhibits different arithmetic and memory operation types and operand 
sizes. In order to provide a fair evaluation of the improvement in complexity and memory access with the technique 
proposed in section IIVI complexity normalization is necessary. 

For arithmetic operations, normalization can be done in terms of 2-input one bit full adders {Add{l, 1)). Each one 
of the adders, subtractors, and multipliers can be converted into an equivalent number of Add{l^ 1). For adders and 
subtractors, bit-to-bit half and full adders are used and generalized for operand sizes ni and ^2. Obtained formulas 
are summarized in Table Hill with simple, yet accurate, analysis of all corner cases. Similarly, multiplication operations 
are normalized using successive addition operations. Memory access operation of m word of size n are normalized 
to one memory access operation of 7n x n bits. 

Applying the proposed complexity normalization approach to Table |IV] leads to the results summarized in Table |Vl 

VI. Discussions and achieved gains 

This section evaluates and discusses the achieved complexity reductions using the proposed original iteration 
scheduling of TBICM-ID-SSD at different modulation orders and code rates. As concluded in section |IV] two 
demapping iterations can be eliminated while keeping the number of turbo decoding iterations unaltered. Overall, 
this will lead to a reduction corresponding to two times the execution of the SISO demapping function. Besides the 
fact that the obtained results will depend on the modulation order and code rate, a third parameter should be considered 
regarding the iterative demapping implementation choice. In this regard, two configurations should be analyzed. In the 
first configuration, denoted CASE 1, the Euclidean distances are re-calculated at each demapping iteration. While in 
the second configuration, denoted CASE 2, the computation of the Euclidean distances are done only once, at the first 
iteration, then stored and reused in later demapping iterations. Thus, CASE 1 implies higher arithmetic computations, 
however less memory access, than CASE 2. 
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SISO rotated 

demapper with 
a priori input 


Computation units 


Number and Type of operations per modulated symbol per turbo demapping iteration 


Euclidean distance 


123.75.2^^+1 Add{l, 1) + load{28 + 2^^+^) 


a priori adder 


{2^i - 2){1.5E[^^\+ ^.5E[M^\ + 9.5E[^^] + 2A.5.M]Add[l, 1) + load{8M) + load{M2^) 
For QPSK 15.M(2^ - 2)Add{l, 1) + load{8.M + M.2^0 


Minimum finder 


(8 + 13.6.2^^ )MAdd{l, 1) + siore(8.M) 


SISO double 
binary turbo 
decoder 


Computation units 


Number and Type of operations per coded symbol per turbo decoding iteration 


Branch metric 


304Add{l, 1) + load{100) 


State metric 


1040Ac;£^{l,l) + store{SO) 


extrinsic infoiTnation 


160Add{l, 1) + load{SO) + siore(50) 



TABLE V: Complexity computation summary after normalization. 



Modulation scheme 


CASEl (With recomputed Euclidean distances) 


CASE2 (Witii stored Euclidean distances) 


Re = 1/2 


Re = 6/7 


Re = 1/2 


Re = 6/7 


Comp 

arith 


exity Redi 
load 


iction 
store 


Comp 

arith 


exity Reduction 
load store 


Complexity Reduction 
arith load store 


Complexity Reduction 
arith load store 


QPSK 


11.9% 


10.6% 


3.7% 


8.2% 


7.1% 


2.2% 


2.5% 


12% 


3.4% 


1.6% 


8.2% 


2.1% 


QAM 1 6 


20.3% 


13.7% 


3.7% 


15.9% 


9.7% 


2.2% 


11.6% 


18.1% 


3.1% 


8.3% 


13.4% 


2% 


QAM64 


27.9% 


21.4% 


3.7% 


25% 


17.1% 


2.2% 


21.8% 


26.5% 


2.5% 


18.5% 


22.3% 


1.7% 


QAM256 


31.6% 


28.4% 


3.7% 


30.5% 


25.7% 


2.2% 


27.6% 


32.2% 


1.5% 


26.2% 


30% 


1.2% 



TABLE VL Reduction in number of operations, read/write access memory comparing "AIDem_2EIDec" to "QIDem" 
for different modulation schemes and code rates. 

Using the normalized complexity evaluation of Table IVl achieved gains comparing 4IDem_2EIDec to 6IDem for 
all configurations are summarized in Table |VI] In the following we will explain first how these values are computed 
and then discuss the obtained results. 

In fact, considering the code rate Re and the number of bits per symbol M, the relation between the number of 
double binary coded symbols (NcodedSymb) and the corresponding number of modulated symbols (Ni\iodSymb) can 
be written as follows. 



NModSymb 



M.Rc 



(15) 



The complexity reduction (G) corresponds to the ratio between the complexity of two SISO demapping executions 
and the complexity of the original TBICM-ID-SSD configuration. If the original TBICM-ID-SSD configuration requires 
Nit iterations to process a frame composed of NnjodSymb modulated symbols (equivalent to NcodedSymb coded 
symbols), the complexity reduction can be approximated by the following expression. 



G 



2.FDem (M) .NModSymb 



(16) 



Nit.FDem{M) .N ModSymb + Nit .FoecNcodedSymb 

where Foem designates the complexity of SISO demapper which depends on the constellation size and Fdi^c designates 
the complexity of SISO decoder When converting in this equation the number of modulated symbols into equivalent 
coded symbols using equation ( fTSl l. we obtain the following equation. 



G = 



2.Fd,^{M) 



NfFoerniM) + NfFoe 



(17) 



This last equation has been used to obtain individually the complexity reductions in terms of arithmetic, read 
memory access, and write memory access operations of Table fVll for Nt = 6. For CASE 1, results show increased 
benefits in terms of number of arithmetic operations (up to 31.6%) and read memory accesses (up to 28.6%) with 
higher modulation orders. This can be easily predicted from equation ST% as the value of Fnem increases with the 
constellation size. The equation shows also that the higher the code rate is, lower the benefits are. 
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On the other hand, the improvement in write memory access (3.7% for Re = 1/2 and 2.2% for Re — 6/7) is 
low and constant for all modulation orders. In fact, in Table IVTl the single memory store term which depends on the 
modulation order is store{8.M) for the minimum finder computation. This term is required per modulated symbol 
and when converted to the equivalent number per coded symbol (equation (fTsT i) for a fixed code rate a constant value, 
independent from M, is obtained. 

Similar behavior is shown for CASE 2, except for two points. The first one concerns the improvements in arithmetic 
operations and read memory accesses. In fact, compared to CASE 1, this configuration implies less arithmetic and 
more memory access operations which lead to less benefits for the former and more benefits for the latter (equation 
(fTTIl). The second point concerns the improvement in write memory access. In fact, besides the term M x Sbits, a 
value of 19 x 2*^ is required only for the first iteration to store the 2^^ Euclidean distances quantized on 19 bits 
each. This added value is much higher in comparison to the reduced M x Sbits write memory access. Therefore the 
improvement in write access memory operations will be less for higher constellation sizes (down to 1.2%). 

It is worth noting that applying the proposed scheduling combined with an early stopping criteria might diminish 
the benefit from the scheduling, but at the cost of an additional complexity. 

VII. Conclusion 

Convergence speed analysis is crucial in TBICM-ID-SSD systems in order to tune the number of iterations to be 
optimal when considering the practical implementation perspectives. Conducted analysis has demonstrated that omitting 
two turbo demodulation iterations without decreasing the total number of turbo decoding iterations leads to promising 
complexity reductions while keeping error rate performance almost unaltered. A maximum loss of 0.15 dB is shown 
for all modulation schemes and code rates in a fast-fading channel with and without erasure. The number of normalized 
arithmetic operations is reduced from 8.2% for QPSK configuration to ■]^% for QAM256 (e.g. for Nu — 6 this gives 
a reduction of 33.3%). Similarly, the number of read access memory is reduced in a range between 8.2% to ■^%- 
This complexity reduction improves significantly latency and power consumption, and thus paves the way towards the 
adoption of TBICM-ID-SSD hardware implementations in future wireless receivers. Future work targets the extension 
of this analysis to other baseband iterative applications and its integration into available hardware prototypes. 
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